refactor: logical instead of physical extension array#6409
refactor: logical instead of physical extension array#6409universalmind303 wants to merge 1 commit intomainfrom
Conversation
Greptile SummaryThis PR refactors Key changes:
Confidence Score: 3/5
Important Files Changed
Class Diagram%%{init: {'theme': 'neutral'}}%%
classDiagram
class ExtensionArray {
+Arc~Field~ field
+Arc~str~ extension_name
+Option~Arc~str~~ metadata
+Series physical
+new(field, physical) Self
+to_arrow() DaftResult~ArrayRef~
+slice(start, end) DaftResult~Self~
+concat(arrays) DaftResult~Self~
+with_physical(physical) Self
+rename(name) Self
}
class Series {
+Arc~dyn SeriesLike~ inner
}
class ExtensionGrowable {
+String name
+DataType dtype
+Box~dyn Growable~ physical_growable
+extend(index, start, len)
+add_nulls(additional)
+build() DaftResult~Series~
}
class ExtensionType {
+get_dtype() DataType
ArrayType: ExtensionArray
}
class DaftDataType {
<<trait>>
+get_dtype() DataType
+ArrayType
}
class SeriesLike {
<<trait>>
+to_arrow() DaftResult~ArrayRef~
+cast(datatype) DaftResult~Series~
+filter(mask) DaftResult~Series~
+agg_list(groups) DaftResult~Series~
}
DaftDataType <|-- ExtensionType
SeriesLike <|.. ArrayWrapper~ExtensionArray~
ExtensionArray --> Series : wraps physical
ExtensionGrowable --> ExtensionArray : builds
ExtensionType --> ExtensionArray : ArrayType
Last reviewed commit: 71bf5b1 |
| ); | ||
| let is_valid = self.is_valid(idx); | ||
| if is_valid { | ||
| if self.physical.is_valid(idx) { |
There was a problem hiding this comment.
Potential panic on to_arrow() failure
to_arrow() on ExtensionArray is now fallible — it performs a field lookup (which can fail if the extension type has no Arrow representation) and potentially a cast. The previous implementation was infallible (it returned the inner Arrow array directly). Using unwrap() here will panic at runtime if the cast fails, rather than returning None gracefully.
Consider propagating the error or returning None:
| if self.physical.is_valid(idx) { | |
| let scalar = Scalar::new(scalar.to_arrow().ok()?); |
| fn agg_list(&self, groups: Option<&GroupIndices>) -> DaftResult<Series> { | ||
| self.0.physical.agg_list(groups) | ||
| } | ||
|
|
||
| fn agg_set(&self, groups: Option<&GroupIndices>) -> DaftResult<Series> { | ||
| self.0.physical.agg_set(groups) | ||
| } | ||
|
|
There was a problem hiding this comment.
agg_list and agg_set discard extension type info
Both agg_list and agg_set delegate directly to self.0.physical, which returns a List<storage_type> (e.g., List<Binary>) instead of List<Extension(...)>. This silently strips the extension metadata from the result, causing type mismatches for any downstream code that expects List<Extension> from aggregating an ExtensionArray.
min and max have the same issue — they return the physical type rather than an ExtensionArray. For completeness, agg_list/agg_set at minimum should wrap the result back with the extension metadata since they produce a list aggregation of the same type.
| DataType::Extension(..) => { | ||
| let physical = map.next_value::<Series>()?; | ||
| let physical = physical.to_arrow().unwrap(); | ||
|
|
||
| Ok(ExtensionArray::from_arrow(Arc::new(field), physical) | ||
| .unwrap() | ||
| .into_series()) | ||
| Ok(ExtensionArray::new(Arc::new(field), physical).into_series()) |
There was a problem hiding this comment.
Serialization format change breaks backward compatibility with old data
The old serialize for ExtensionArray stored the physical series with the hardcoded field name "physical":
Field::new("physical", inner.as_ref().clone())The new serialize stores self.physical directly, which has the actual column name as its field name.
When deserializing old-format data with this new code, map.next_value::<Series>() would produce a series named "physical", while field.name is the actual column name (e.g., "my_col"). The resulting ExtensionArray is then created with a field named "my_col" and a physical series named "physical" — a silent mismatch.
This mismatch propagates to any code that calls self.physical.name() expecting the column name (e.g., display, schema inspection). Consider validating that physical.name() == field.name in ExtensionArray::new, or documenting this constraint.
| fn if_else(&self, other: &Series, predicate: &Series) -> DaftResult<Series> { | ||
| let other_physical = match other.downcast::<ExtensionArray>() { | ||
| Ok(other_ext) => &other_ext.physical, | ||
| Err(_) => other, | ||
| }; | ||
| Ok(self | ||
| .0 | ||
| .with_physical(self.0.physical.if_else(other_physical, predicate)?) | ||
| .into_series()) | ||
| } |
There was a problem hiding this comment.
if_else with a non-extension other may silently produce wrong types
When other is not an ExtensionArray, the code falls back to using other directly:
Err(_) => other,This passes the raw other series (of an arbitrary type) into self.0.physical.if_else(other_physical, predicate). While the inner if_else may fail with a type mismatch error, this isn't immediately obvious from the call site. It would be clearer to return an explicit DaftError::TypeError when other is not an ExtensionArray, rather than falling through to a potentially confusing error from the physical layer.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6409 +/- ##
==========================================
- Coverage 74.78% 74.68% -0.10%
==========================================
Files 1020 1023 +3
Lines 136319 136557 +238
==========================================
+ Hits 101949 101994 +45
- Misses 34370 34563 +193
🚀 New features to boost your workflow:
|
Changes Made
refactors ExtensionArray into a logical wrapper instead of a physical type. This closely aligns with how arrow-rs and pyarrow treat extensions (purely metadata, not a dedicated datatype)
Related Issues
Closes #6408